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Abstract 

The effect of person misfit to an item response theory (IKT) model on a mastery/non- 
mastery decision was investigated. Furthermore, it was investigated whether the 
classification precision can be improved by identifying misfitting respondents using 
person-fit statistics. A simulation study was conducted to investigate the probability of 
a correct classification using different estimation methods, person-fit statistics, model 
violations, test lengths and sample sizes. In this simulation study, the effect of the presence 
of misfitting item score patterns on the item parameter estimates was also taken into 
account. Results showed that the effect of the presence of misfitting item score patterns on 
the classification of non-aberrant simulees was in general small, that is, the classification 
precision for these simulees hardly suffered. Further, for simulees classified as non- 
aberrant using a person-fit statistic, the classification decisions were comparable with 
a priori known non-aberrant simulees. The conclusion is that person-fit statistics can be 
used for identifying a sub-sample of respondents where relatively precise mastery/non- 
mastery decisions can be made. These results were comparable across different person-fit 
statistics and estimation methods. 

Key Words: IKT, 3PNO, Guessing, Item disclosure, Person-fit statistics, Estimation 
methods 
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Introduction 

To determine the fit of an item score pattern to an item response theory (IKT, 
Lord, 1980; van der Linden & Hambleton, 1997) model, several fit statistics have been 
proposed. Investigating the fit of an item score pattern to an IKT model may help the 
researcher to obtain additional information about the response behavior of a person, which 
may, for instance, be influenced by guessing, or preknowledge of the items. For an 
overview of person-fit research, see Meijer and Sijtsma (2001). From this overview 
it can be concluded that many person-fit studies have concentrated on the power of 
person-fit methods to detect misfitting item score patterns. In simulation studies, the 
percentage of correctly classified misfitting item score patterns is investigated given a 
priori knowledge of some type of misfitting response behavior. A general conclusion 
from these studies is that the power of the person-fit statistics is relatively low (Meijer 
& Sijtsma, 2001). However, relatively low power of a person-fit statistic should be 
interpreted in relation to the effect on the estimation of the latent trait 0 or the classification 
of a person in a prespecified category (for example, when taking mastery/non-mastery 
decisions). Therefore, knowledge about the effect of misfit of an item score pattern on the 
classification is crucial for the use of person-fit in an applied setting. To know what type 
of misfit has an effect on classification decisions may help the researcher to test against 
specific types of aberrant behavior. In this paper, we will first investigate the robustness 
of the classification decision under different types of misfit using different methods to 
estimate 0 . 

In realistic situations, it is unknown which respondents are aberrant. Therefore, it 
must be expected that the item parameter estimates will be biased by the contamination 
of the data. The impact of the presence of non-fitting response patterns on the outcome 
of the person fit tests and on the classification decisions will be the second topic of the 
investigation. More specifically, it will be investigated whether person-fit statistics can be 
used for identifying a sub-sample of respondents where relatively precise mastery/non- 
mastery decisions can be made. 
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Three estimation methods for 9 will be used: a Maximum Likelihood Estimation 
(MLE) method and two Bayesian methods. The first Bayesian method is based on an 
Expected A Posterior (EAP) estimate of 9 given estimates of the item parameters, the 
second Bayesian method is based on the complete posterior distribution of 6 and the 
item parameters. The latter method has the advantage that both the uncertainty about 
the person and the uncertainty about the item parameters are taken into account. The 
computations for the latter procedure are performed using a Markov Chain Monte Carlo 
(MCMC) algorithm. 

This paper is organized as follows. First, we will introduce the relevant IRT model 
and some methods to estimate 9. Second, we will discuss person-fit statistics that are 
often used in practice. Third, we will present the result of a simulation study in which we 
investigate the effect of person misfit and the performance of person fit tests for different 
levels of misfit, test lengths, sample sizes, estimation methods and test statistics. 

IRT Model 

In IRT, the probability of a correct response on item j (J = 1,2, ...,&), Pj (9), is 
a function of the latent trait value 9 and a number of item characteristics. Models that 
are most often used are the one-, two-, and three-parameter logistic model ( 1 -, 2 -, and 
3-PL; Hambleton and Swaminathan, 1985, pp. 35-48). In this study, however, we will 
use the 3-parameter normal ogive (3PNO; Lord, 1980, pp. 13-14) model, because in 
a Bayesian framework, the 3PNO model has some computational advantages over the 
logistic models (see, for example, Albert, 1992). The 3PNO model and 3PL model 
are completely equivalent for all practical purposes. In the 3PNO model, the item is 
characterized by a difficulty parameter a discrimination parameter a jt and a (pseudo) 
guessing probability 7 • . The probability of correctly answering an item is given by 




p j ( d ) = 7 ; + (1 “ 7 j) $ {<*j0 ~ Pj) . 



( 1 ) 
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where $ is the standard cumulative normal distribution. Since the 3PNO and 3PL models 
predict nearly identical item response functions (IRFs), few differences in either model 
fit or parameter estimates are expected (Embretson & Reise, 2000, pp. 78-79). 



Methods for Estimating 9 



MLE 

In ERT, the probability of a correct response on an item depends on 9, and the 
parameters that characterize the item. Both 9 and the item parameters are unknown. 
Almost all person-fit studies have used the MLE method to estimate 9 (Meijer & Sijtsma, 
1995). The MLE, estimator 9 is the value of 9 that maximizes the likelihood function for a 
particular response pattern. It is usually assumed that the items fit the IRT model and that 
the item parameters are known. Advantages of MLE is that 9 values tend to be consistent 
and efficient (Hambleton & Swaminathan, 1985). The main disadvantages of MLE are 
that 9 does not exist for a perfect item score pattern (all items correct) and pattern with all 
items incorrect, and that 9 is biased, that is, it is overestimated for positive values of 9 and 
underestimated for negative values of 9 (Lord, 1983; see Warm, 1989, for improvements). 
Though it is assumed that the item parameters are known, in realistic situations they are 
unknown and have to be estimated. In the simulation studies reported below, the item 
parameters will be estimated using the a maximum marginal likelihood (MML) procedure 
implemented in BELOG-MG. For more information about MLE procedure for the normal 
ogive model, refer to Baker (1992). 

EAP 

As Bayesian alternatives to MLE in person-fit research, Reise (1995) used EAP 
estimation and Glas and Meijer (2001) used MCMC estimation. In EAP estimation, both 
the response vector and information about the examinees are combined. The posterior 
distribution is proportional to the product of the likelihood of the item score pattern given 
9 and a, usually, normal prior for 9. The EAP estimate is simply the mean of the posterior 
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distribution. An advantage of EAP estimation is that the extra information obtained using 
the prior can improve the estimation of 9, and unreasonable values of 9 can be avoided. 
This, of course, has the side effect that the resulting 9 will regress to the mean of the 
prior (shrinkage). Also in EAP estimation, the item parameters are imputed as fixed 
known constants. Below, values for these constants are estimated using the Bayes modal 
procedure implemented in BILOG-MG (Mislevy & Bock, 1990). 

MCMC 

Whereas procedures for conventional frequentist statistical inference focus attention 
on point estimates and their standard errors, Bayesian methods often seek to characterize 
the posterior distribution of the parameters. This can done by an MCMC method, which 
produces samples from the joint posterior density of model parameters that may be then 
summarized to estimate 9 (Jackman, 2000). Because this technique is somewhat less well 
known as MLE and EAP methods we will discuss it in more detail. 

Below, the MCMC method used will be the Gibbs sampler (Geman & Geman, 
1984, Gelfand & Smith, 1990); To implement the Gibbs sampler, the parameter vector 
is divided into a number of components, and each successive component is sampled 
from its conditional distribution given sampled values for all other components. This 
sampling scheme is repeated until the sampled values form stable estimates of the 
posterior distributions. Albert (1992) applies Gibbs sampling to estimate the parameters 
of the well known 2PNO model; a generalization to the 3PNO model is given by Beguin 
and Glas (in press). The latter generalization entails a data augmentation scheme defined 
as follows. Let the binary variable Wij be defined as: 

yy _ $ 1 if person i knows the correct answer to item j 

|0 if person i doesn’t know the correct answer to item j 

The relation between = 1 and observed response variable Y r j is given by a model 
where $ (77^) , with 77 ^ = a J 9 l — / 3 J5 is the probability that the respondent knows the item 
and gives a correct response with probability one, and a probability (l — $ (%)) that 
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the respondent does not know the item and guesses with 7 ^ as the probability of a correct 
response. The data are also augmented with latent data Zy which are independent and 
normally distributed with mean rj^ and a standard deviation equal one. These variables 
are related to W by Z tj > 0 if W tj = 1 and Z {j < 0 if Wy = 0. The aim of the procedure 
is to simulate samples from the joint posterior distribution, p (a, (3, 7 , 6, z, w\y ), where 
the data Y are responses of n test simulees to k items. The procedure to calculate the 
posterior distribution, consists of the following steps: 

1 . Draw from the posterior p (z, w\y,a, (3, 7 , 6) via the data augmentation model; 

2. Draw from the conditional distribution of 6 given Z and a, (3 via a normal regression 
model; 

3. Draw from the conditional distribution of the parameters of item j, aj and (3j via a 
normal regression model; 

4. Draw from conditional distribution of 7 ^, which is a beta distribution when the conju- 
gate Beta prior is used. 

Convergence is evaluated by comparing the between and within sequence variance. 
Bayes modal estimates obtained using a standard software package as BILOG-MG can 
be used as starting points. The estimate 6 is taken as the average of the draws in Step 2. 
So also in this case, the point estimate of 6 is an expected a posterior estimate. For more 
information on this algorithm refer to Albert (1992) and Beguin and Glas (in press). 

Person-Fit Statistics 

Several person-fit statistics for investigating the goodness of fit of item score patterns 
have been proposed. In this paper, we will use a number of statistics that have been 
most often used in the literature (Meijer & Sijtsma, 2001). Glas and Meijer (2001) found 
that these statistics had an acceptable type I error rate when simulating the distribution 
for these statistics using an MCMC method. Type I error rate indicates the number of 
incorrectly rejected null hypotheses based on the statistical tests. The following statistics 
were used. 

The W statistic (Wright and Stone, 1979) is defined by 
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EjLi^Wl i-PiV)]' 



(3) 



where the difference between the item score Yj and the expected item score Pj (6) is 
weighted by the variance of the item score. 

A related statistic was proposed by Smith (1985,1986) where the set of test items is 
divided into S non-overlapping subtests denoted A s (s = 1 , S'). Then the unweighted 
between-sets fit statistics UB is defined as 



The UB statistic is a weighted W statistic computed at the subtest level. 

Two other statistics, £ : and £ 2 , were proposed by Tatsuoka (1984). The £1 statistic is 
the standardization with a mean of 0 and unit variance of 



where rij denotes the number of correct answers to item j and n denotes the mean number 
of correctly answered items in the test. The index will be positive when easy items are 
incorrectly answered and difficult items are correctly answered, and it will also be positive 
if the number of correctly answered items deviates from the overall mean score of the 
respondents. If a response pattern is misfitting in both senses, the magnitude of the index 
will be laigely positive. The £ 2 statistic is a standardization of 




(4) 



k 




(5) 



k 




( 6 ) 
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where R is the person’s number-correct score on the test. The index will be positive if 
the response pattern is misfitting in the sense that easy items are incorrectly answered 
and difficulty items are correctly answered; the overall response tendencies of the total 
sample of persons is not important here. 

Another well-known person-fit statistic is the log-likelihood statistic 

l = £ {Yj In Pj (9) + (1 - Yj) In [1 - P s (0)]} , (7) 

3 = 1 

which was first proposed by Levin and Rubin (1979). It was further developed in 
Drasgow, Levine, and Williams (1985), and Drasgow, Levine, and McLaughlin (1991). 

Drasgow et al. (1985) proposed a standardized version l z of l which is asymptotically 
standard normally distributed; l z is defined as 



'* [Var (i)) 1 ' 2 ' 



( 8 ) 



where E (l) and Var (l) denote the expectation and the variance of l, respectively. The 
person-fit statistic l z is often used, but Molenaar and Hoijtink (1990), and Van Knmpen- 
Stoop and Meijer (1999) showed that the distribution of l z is negatively skewed. This 
skewness influences the differences between nominal and empirical Type I error rates for 
small Type I error values. They found that increasing the item discrimination resulted in a 
distribution that was more negatively skewed. In ah MCMC framework, the distribution 
of a statistic is simulated, so its skewness is of minor importance. Therefore, we will only 
consider the person-fit statistic l instead of l z . 



Evaluating the fit of an item score pattern. 

To evaluate the fit of an item score pattern a norm distribution is needed for 
classifying an item score pattern as fitting or misfitting. This norm distribution can 
be obtained using a theoretical distribution (e.g., a normal distribution) or it can be 
simulated. In this paper, we will simulate the norm distribution because often the 
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theoretical distributions proposed in the literature are not in the agreement with the 
empirical distributions (Meijer & Sijtsma, 2001). Also, the error in the item and person 
parameters can be taken into account when we simulate the norm distribution. Recently, 
Glas and Meijer (2001) used a Bayesian approach. Their approach has the advantage 
that they take into account the uncertainty of the parameters in the IRT model. In this 
Bayesian method, the posterior distribution of the parameters of the 3PNO model, say 
p (f |y) where f are the item and person parameters in the model and y is the observed data, 
is simulated using the MCMC method given above. Person fit is then evaluated using a 
posterior predictive check based on an index T (y, £) where T refers to the person-fit 
statistics given above. When the Markov chain has converged, draws from the posterior 
distribution can be used to generate model-conform data y rep and to compute the Bayes 
p- value defined by 



Bayes p-value = Pr (T ( y rep , f) > T ( y , £) |y) . (9) 

Thus, the Bayes p- value is defined as the probability that the replicated data are more 
extreme than the observed data. Posterior predictive checks are performed by inserting 
the person-fit statistics: l, W, UB , £i> and ( 2 * nt0 equation (9). After the bum-in period, 
when the Markov Chain has converged, in every n-th iteration (n > 1), using the current 
draw of the item and person parameters, a person-fit statistic T (y, £) is computed, a new 
model conform response pattern is generated, and a value T ( y Tep , £) is computed. Finally, 
a Bayes p-value is computed as the proportion of iterations where T (y rep , £) >T (y, £). 

Simulation Studies 

The objective of these studies was to assess the probability of correctly classifying a 
person according to his or her 9 value. A simulee was classified as a master if 9 > 0 and as 
a non-master if 9 ^ 0. The cut-off score was chosen equal to zero. A classification error 
arises when the sign of the generating value of 9 is not equal to the sign of the estimate. 
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The reason for choosing a cut-off score equal to zero is to minimize the effect of the bias 
of 9. For example, when an EAP estimator is used to estimate 9, the estimate shrinks 
towards the mean of 9, which is assumed to be equal to zero (Mislevy, 1986). Thus, the 
bias is smallest at the mean of zero. 

Model violations 

Guessing 

To investigate the robustness of the classification decision, random guessing was 
simulated for a subset of the simulees in the data on part of the items on the test. For that 
part of the test, these simulees were randomly responding with the probability of a correct 
score equal to 0.20. Random response behavior may result from disinterestedness in the 
test in situations where, for example, a test is used to evaluate educational achievement 
without consequences for individual student. We simulated guessing on the easiest items, 
because there the effect of guessing to the estimate of ability is most detrimental (Meijer 
& Nering, 1997). Note that when guessing occurs only on the easy items, the actual score 
will be lower than expected. This implies that the resulting ability estimate will also be 
lower than the ’true’ ability predicted by the model. 

Item Disclosure 

It is possible that a person has preknowledge of some of the items in the test, either 
about the type of test questions or about the correct answers, for example, as a result 
of repeated test taking. Item disclosure may result in a larger percentage of the correct 
answers than expected. 

Note, that in general it is unknown on which of the items and on how many items 
a person has knowledge of the correct answers. Item preknowledge on a few items 
will only have a minor effect on the number-correct score (Meijer & Nering, 1997). 
Also, item preknowledge of the correct answers on the easiest items in the test will 
only slightly improve the number-correct score. This suggests that preknowledge on the 
items of median and high difficulty may have the most profound effect on the total score. 
Therefore, in this paper, we will always assume that item disclosure will affect only the 
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difficult items. For these items, simulees will get a correct score with a probability equal 
to 0.80. 

When item disclosure affects only the difficult items, the actual score will be higher 
than expected. Thus, the effect of item preknowledge is the largest for persons with low 
6 that answer many difficult items correctly. For these persons, the resulting 6 will be 
higher than the ’true’ 6 predicted by the model. 

Simulation Methods 

Data Generation 

The item parameters were chosen as follows. The 7 -parameter was fixed to 0.20 for 
all items. Item difficulty and discrimination parameters were chosen as 

• for a test length k = 30, three values of the discrimination parameter, 0.5, 1.0, and 1.5, 
were crossed with ten item difficulties = —2.00 + 0.40(* — 1), * = 1, ..., 10. 

• for a test length k = 60, three values of discrimination parameters, 0.5, 1.0, and 1.5, 
were crossed with twenty item difficulties = — 2.00 + 0 . 20 (* — 1 ), * = 1 , ..., 20 . 

The ability parameters were drawn from a standard normal distribution. Using these item 
and ability parameters, data were generated, the parameters were both estimated using 
BILOG-MG followed by MLE and EAP and by the MCMC method. Then, the item score 
patterns were classified as normal or aberrant using the person-fit statistics discussed 
above. Because we were interested in the robustness of the classification decision of 
an individual person under model violations, we computed the proportion of correctly 
identified simulees 6 > 0 with 6 > 0 [i.e., P{9 > 0 \9 > 0)] and the proportion of 
correctly identified simulees 0^0 with 0 0 [i.e., P{9 ^ 0 | 6 ^ 0)]. 

Guessing 

Two sample sizes were used n = 400 with 40 misfitting simulees, and n = 1000 with 
100 misfitting simulees. Guessing was simulated on 1 / 6 , 1/3, and 1/2 of the easy items 
in the test. The probability of a correct response to these items was chosen to be 0.20. 
For every condition, 100 replications were made with a nominal significance probability 
of 0.05 for every person-fit statistic. 
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To investigate the robustness of the classification decisions, we first determined the 
proportion of correct mastery decisions in a group with only fitting item response patterns. 
Then, we determined the proportion correct mastery decisions in a group with fitting and 
misfitting simulees where the item parameters were estimated using both the fitting and 
misfitting simulees. Finally, we determined the proportions of correct mastery decisions 
in the groups of simulees classified as fitting and misfitting on the basis of a person-fit 
statistic. 

Item Disclosure 

The setup of the simulation study for item disclosure was analogous to study for the 
guessing. Data were generated for sample sizes of n = 400 and n = 1000 simulees, 
and test lengths of k = 30 and k = 60 items. Item disclosure was simulated for 10% of 
the simulees, and for these simulees, 1/6, 1/3, and 1/2 of the difficult items in the test 
were affected. The probability of a correct response to these items was chosen to be 0.80. 
Test statistics were computed in the same way as in the guessing study. Again, in every 
condition, 100 replications were made with a nominal significance probability of 0.05 for 
every person-fit statistic. 

Again, the proportion of correct mastery decisions without the presence of misfitting 
simulees was used as a base rate. Furthermore, we determined the proportion of 
correct mastery decisions for fitting or misfitting simulees, and for fitting and non-fitting 
simulees as identified using person fit statistics, respectively. 

The MCMC procedure 

For the MCMC procedure, a run length of 4000 iterations with a bum-in period of 
1000 iterations was chosen (see Albert, 1992). That is, the first 1000 iterations were 
discarded. In the remaining 3000 iterations, T ( y rep , £) and T (y,£) were computed every 
fifth iteration. So the posterior predictive checks were based on 600 draws. For the 
statistics that use a partitioning of the items into subtests, the items were ordered according 
to their item difficulty (3 and then two subtests of equal size were formed, one with the 
difficult and one with the easy items. 
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Results 

Guessing 

For a sample with all item score patterns fitting, the proportion of correct mastery 
decisions for n = 400 and n = 1000 simulees with k = 30 and k = 60 items is given 
in Table 1. We divided the simulees into two groups based on 9: groups with 0^0 and 
9 > 0. Recall that in this setup, the proportion of correct mastery decisions is defined as 
the conditional probability P{9 ^ 0 1 9 ^ 0) for 9 ^ 0 and defined as P(9 > 0 | 9 > 0) 
for 9 > 0, respectively. For the example, in Table 1 for the combination n = 400 and 
k = 30, the proportion of correct mastery decisions using the MCMC method is 0.89 for 
0^0. This means that using the MCMC method 89% of the simulees have 9 estimates 
less than or equal to zero in the group with 9^0. 

Insert Table 1 about here 

Comparing the proportions of correct mastery decisions in Table 1, it can be seen 
that there are no main effects of mastery (0^0 versus 9 > 0), and estimation method. 
Furthermore, the proportion of correct mastery decisions is little affected by the sample 
size, that is, by the precision of the item parameter estimates. There is, however, a main 
effect of test length, that is, all proportions for k = 30 are less than those for k = 60. 
This result is as expected because using longer tests will result in a higher proportion of 
correct mastery decisions for the normal simulees. 

Insert Table 2 about here 

Table 2 gives the proportions of correct mastery decisions for data sets with 10% 
guessing simulees. The item parameters were estimated using the data of fitting 
and misfitting simulees simultaneously. Comparing the proportion of correct mastery 
decisions for n — 400 in the normal (non-aberrant) group with 9^0 and with 9 > 0 
(Table 2, upper panel), the proportion of correct mastery decisions in the normal group 
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with 9 > 0 is higher across all estimation methods for p = 1/6, p = 1/3, and p = 1/2, 
than those in the normal group with 9^0. There is one exception for the MCMC method, 
where for 9 > 0, the proportion is 0.71 and for 9 ^ 0 is 0.85. However, for the guessing 
simulees, the proportion of correct mastery decisions in the group with 9 > 0 is always 
smaller than in the group with 9 ^ 0 for p = 1/6, p = 1/3, and p = 1/2. In general, for 
p =1/2 the proportion of correct mastery decisions in the group 9 ^ 0 is almost equal to 
one. This can be explained by noting that guessing is imposed on the easy items resulting 
in lower 9 than true 9. 

For both test lengths, for k = 30 and k = 60, it can be seen that if p increases, the 
proportion of correct mastery decisions decreases in the normal group with 9^0 and 
it increases for the normal group with 9 > 0. In contrast, for the guessing simulees the 
proportion of correct mastery decisions increases for 9 ^ 0 and it decreases for 9 > 0. 
This is due to the fact that when p increases and guessing is imposed on easy items, given 
a fixed score s, the probability to get a score higher than s decreases. Thus, when 9 > 0 
it will result in a lower 9 which implies that the number of the guessing simulees that are 
being misclassified increases as p increases. 

The results across the different estimation methods for the normal simulees are 
similar across test lengths and proportion of simulated guessing p. For guessing simulees 
and 9^0 results are also comparable across estimation methods. However, for guessing 
simulees and 9 > 0 andp =1/3 andp =1/2 the proportions correct classifications differ 
substantially, with MCMC as the least effective estimation method. In the latter case, the 
estimates of 9 are distorted to such a degree that the proportion of correct classifications 
approaches zero. 

Inspection of the results in the condition with n = 1000 (Table 2, lower panel) and 
comparing these results with n = 400 (Table 2, upper panel) shows that the proportion of 
correct mastery decisions is little affected by the smaller calibration sample. 

Insert Tables 3 and 4 about here 
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In realistic situations, it is unknown which respondents are aberrant. The objective 
of the following simulation study is to assess whether the same precision for mastery 
classifications can be attained when persons are classified as aberrant or non-aberrant 
using person fit statistics. The results are given in Table 3 (k = 30, n = 400) and Table 

4 (jfc = 60, n = 400). Comparing classification rates for both k = 30 and k = 60 with 
the classification rates in Table 2, it can be concluded that the classification rates for the 
normal simulees based on a priori knowledge (Table 2) and based on the classification on 
the basis of a person-fit statistic are similar. In the group of guessing simulees, it can be 
seen that the proportion of correct mastery decisions using the person-fit statistics is less 
than in Table 2 for 9 < 0, whereas it is generally higher for 0 > 0 across p, that is, the 
extent to which the test is affected by guessing. This last result may be explained by noting 
that the group of simulees classified as misfitting consists of both simulated guessing 
simulees and normal simulees. Because the normal simulees have in general a higher 0 
value than the guessing simulees, the presence of some normal simulees may increase 
the average 0 value in the guessing group. Furthermore, forp = 1/6, the proportion of 
correct mastery decisions is higher than those obtained for p = 1/3 and p = 1/2. 

In Table 4, the results for k = 60 are depicted. In general, the proportion of correct 
mastery decisions is higher for k = 60 than for k = 30. For n = 1000 (not tabulated), 
analogous trends were found. 

Item Disclosure 

The proportion of correct mastery decisions for n = 400 with k = 30 and k = 60 
based on a priori knowledge, that is, without using person-fit statistics, is given in Table 

5 (upper panel) and n = 1000 (lower panel). 

Insert Table 5 about here 

Comparing the proportion of correct mastery decisions in the normal (non-aberrant) 
group with 0^0 and with 0 > 0 for n = 400 (Table 5, upper panel), the proportion of 
correct mastery decisions for 0 ^ 0 is a little higher than for 0 > 0 across all estimation 
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methods and proportions of item disclosure. Comparing the results with Table 1, it can 
be seen that the proportions are almost similar and that there is no effect of bias in the 
item parameter estimates as a result of the presence of 10% aberrant simulees. For the 
item disclosure simulees, the proportion of correct mastery decisions in the group with 
9 ^ 0 is always smaller than in the group with 6 > 0. Because item disclosure will lead, 
in general, to a higher 6 value, the proportion of correct mastery decisions for 6 ^ 0 is 
reduced. In general, the proportion of correct mastery decisions in the group 6 > 0 is 
almost equal to one. 

Across the different p values, for k = 30 and k = 60, it can be seen that the proportion 
of correct mastery decisions for 9 ^ 0 and for 9 > 0 are similar. In contrast, for the item 
disclosure simulees, the proportion of correct mastery decisions decreases for 9 ^ 0 and 
it increases for 9 > 0. This is due to the fact that when p increases and item disclosure is 
imposed on difficult items, given a fixed score s, the probability to get a score higher than 
s increases. Thus, when 9 ^ 0 it will result in a higher 9, which implies that the number 
of the item disclosure simulees that are being misclassified increases as p increases. 

With respect to the estimation methods, it can be seen that for normal simulees, there 
are only small (inconsistent) differences. For item disclosure simulees and 9 > 0 results 
are also comparable across estimation methods. However, for item disclosure simulees 
and 0^0 and p = 1/3 and p — 1/2, the MCMC method performed better than the EAP 
and MLE. 

Inspection of the results in the condition with n — 1000 (Table 5, lower panel) and 
comparing them to the results for n = 400 (Table 5, upper panel) shows again that the 
proportion of correct mastery decisions is little affected by the smaller calibration sample. 

Insert Tables 6 and 7 about here 

In the Tables 6 and 7, the classification percentages using the person-fit statistics for 
n = 400 and k — 30 (Table 6) and k = 60 (Table 7) are given. From Table 6 it is clear that 
for normal simulees the different estimation methods result in almost the same percentages 
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of correct classifications across the different p values for 6 ^ 0 and 6 > 0. Also, similar 
percentages were found as in Table 5 where the normal simulees were identified based on 
a priori knowledge. For the item disclosure simulees, proportion correct classifications 
differ. Comparing the classification rates in Table 5 and Table 6 we note that in the group 
of item disclosure simulees the proportion of correct mastery decisions using the person- 
fit statistics is, in general, higher than in Table 5 for 0 < 0, whereas it is generally lower 
for 9 > 0 across different p values. This last result may be explained by noting that the 
group of simulees classified as misfitting consists of both simulated item disclosure and 
normal simulees. Because the normal simulees have in general a lower 6 value than the 
item disclosure simulees, the presence of some normal simulees may reduce the average 
6 value in the item disclosure group. No large differences were found across person-fit 
statistics. 

In Table 7, the results for k = 60 are given. In general, the proportion of correct 
mastery decisions is higher for k = 60 than for k = 30. For n = 1000 (not tabulated) the 
same trends were found and the classification percentages were similar as in Table 6 and 
Table 7. 



Discussion 

The effect of person misfit to an IKT model on a mastery/non-mastery decision was 
investigated and it was investigated whether using person-fit statistics can be helpful in 
judging the acceptability of such a decision. Results showed the following. 

(1) The effect of the presence of 10% misfitting simulees had little effect on the item 
parameter estimates in the sense that the mastery classification of normal simulees was 
little affected. 

(2) The classification precision of aberrant simulees was greatly affected: the 
precision for guessing simulees with 6 > 0 became virtually zero, especially when the 
MCMC method was applied. For item disclosure simulees with 6 < 0, the effects were 
less dramatic than for the guessing, although for p = 1/2 the classification rates were 
substantially lower than in the case for normal simulees. 
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(3) For simulees classified as normal by means of a person-fit statistic the 
classification rates were comparable with the situation where there was a priori knowledge 
about normal/aberrant behavior. 

In general the effect of the type of estimation method and the type of person-fit 
statistic was small, though the ^-statistic generally gave the best results. The MCMC 
method seemed to result in greater classification precision in the case of item disclosure. 
In the case of guessing, and in cases were relatively small numbers of items were 
misfitting (p = 1/6 and p = 1/3), the EAP method performed better. However there 
were no trends where one estimation method outperformed the other estimation methods. 

The main conclusion is that the classification precision in the sub-sample identified 
as normal (non-aberrant) by a person fit statistic is comparable to the classification 
precision that can be attained if aberrant and non-aberrant respondents were known in 
advance. Classification precision for aberrant persons is erratic. This suggests that 
further research might be done to methods for identifying a subset of items where an 
aberrant person gives model-conforming responses and using this subset to estimate 9. 
An additional research question will then be to what extent thus shortened test length 
decreases the precision of the estimate of 9 and the ensuing classification precision. 
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Table 1 



The Proportion of Correct Mastery Decisions. 
All Item Score Patterns Fitting. 





Method 


n = 400 




n = 1000 


9^0 


9 >0 


9^0 


9> 0 




MCMC 


0.89 


0.89 


0.90 


0.90 


k = 30 


EAP 


0.89 


0.89 


0.90 


0.89 




MLE 


0.88 


0.89 


0.89 


0.90 




MCMC 


0.93 


0.93 


0.93 


0.93 


o 

VO 

II 

M 


EAP 


0.93 


0.92 


0.93 


0.93 




MLE 


0.93 


0.92 


0.93 


0.93 
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W 0.89 0.96 0.73 0.46 0.85 0.96 0.49 0.06 0.85 0.98 0.47 0.24 

UB 0.89 0.96 0.72 0.46 0.85 0.96 0.49 0.06 0.84 0.98 0.46 0.24 

Ci 0.89 0.96 0.63 0.61 0.85 0.97 0.64 0.09 0.85 0.98 0.65 0.33 

C 9 0.89 0.96 0.89 0.45 0.85 0.97 0.93 0.06 0.84 0.98 0.96 0.24 
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